Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add PCI device interface and a vfio backend driver #487

Merged
merged 1 commit into from Aug 28, 2020
Merged

Conversation

liva
Copy link

@liva liva commented Mar 17, 2020

LKL is now able to manage PCI devices by itself like DPDK!
This commit implements a PCI bus driver and PCI device interface. The commit also includes a vfio-pci backend driver as a reference implementation, thus no extra kernel modules is needed if the PCI device is assigned to VFIO.

I believe the vfio backend driver fulfills the need of most cases, but users can inject wrapper functions for another userspace PCI framework such as uio-pci-generic or a handmade kernel module. In either case, the framework should provide physically contiguous memory for DMA, because the kernel and some drivers (e.g. NVMe) assume its memory as physically contiguous.


This change is Reviewable

@liva liva force-pushed the x86 branch 2 times, most recently from db86eb1 to 7c06e0b Compare March 17, 2020 08:16
@tavip
Copy link
Member

tavip commented Mar 19, 2020

Thanks for the PR!

I am seeing some failures on the x86_64 tests:

+ LKL_HIJACK_CONFIG_FILE=/tmp/tmp.6kZDin3Te8/hijack-test.conf /home/ubuntu/project/tools/lkl/bin/lkl-hijack.sh ip addr
ip: lib/posix-host.c:302: panic: Assertion `0' failed.
/home/ubuntu/project/tools/lkl/bin/lkl-hijack.sh: line 22: 23849 Aborted                 (core dumped) LD_PRELOAD=liblkl-hijack.so $*
  TEST       not ok   test_pipe_ping 

as well as compile errors on mingw:

lib/iomem.c:3:22: fatal error: sys/mman.h: No such file or directory

It will be great if you could try running the tests on your environment to confirm if it is failing there as well.

I'll follow up in the next days with a more indepth review.

@liva
Copy link
Author

liva commented Mar 21, 2020

Hi Octavian! Thanks for the comment!
And yes, it is no doubt that there are a lot of things to do: running tests, redesigning implementation, and so on.... I will take care of theses things until it becomes "ready for review".
I'm really glad to your support, and please tell me if I missed something.

@thehajime thehajime marked this pull request as ready for review July 6, 2020 04:49
Copy link
Member

@thehajime thehajime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change the title as uio is no more used ?

the reason to use vfio instead is nice to have in the PR comments.

@liva liva changed the title initial implementation of uio add PCI device interface and a vfio backend driver Jul 10, 2020
@liva liva force-pushed the x86 branch 2 times, most recently from 1e2fc33 to dcfd93e Compare July 20, 2020 08:20
Copy link
Member

@thehajime thehajime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left several comments.

arch/lkl/drivers/pci.c Show resolved Hide resolved
_memory_start = (unsigned long)lkl_ops->mem_alloc(mem_size);
if (lkl_ops->pci_ops && lkl_ops->pci_ops->alloc) {
_memory_start =
(unsigned long)lkl_ops->pci_ops->alloc(mem_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When pci_ops is installed but not used, do we need to use pci_ops->alloc() ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, pci_ops->alloc() just call mmap inside, so I updated the patch and introduced a new interface lkl_ops->page_alloc(), whose implementation is equivalent to that of pci_ops->alloc().

Now, PCI-related functions are removed from kernel memory allocation.
https://github.com/lkl/linux/pull/487/files#diff-65cbb8e45dedadd081421ddc9002d12fR17

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, asking to @tavip for more comments if any.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate why is page_alloc needed? The only user I see is in bootmem_init and there we already make sure that we skip host memory until it is paged aligned.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! In this scenario (with the vfio backend), we can just use memory allocated by malloc().
Meanwhile, in other cases, I believe lkl should provide alternative memory allocation interface other than host_ops->alloc() for PCI DMA. For example, the architecture without IOMMU requires physically contiguous pages for bootmem, which is impossible to allocate with malloc(). Of course developers can override host_ops->alloc() for this, but it turns out inefficient memory consuming because host_ops->alloc() always has to return pages. This is why I proposed pci_ops->alloc() at the first commit.

However, your comment made me realize that the name "page_alloc" leads misunderstanding easily. What we need here is not a page allocation, but allocating memory suited for the PCI backend driver. I'm considering how I should fix and will apply that in a couple of days.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the detailed explanation. pci_ops->alloc() sounds good (maybe alter we can convert it into a host dma_alloc if there are other users).

One thing that is no clear to me, is who will use this memory? I don't think we need to allocate bootmem with it, perhaps just internal PCI buffers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree that allocating the whole lkl memory with dma_alloc is not a sophisticated way.
However, linux components assume that the memory allocated by kmalloc() or other memory allocating functions is DMA-capable.

Documentation/DMA-API-HOWTO.txt says

If you acquired your memory via the page allocator
(i.e. __get_free_page*()) or the generic memory allocators
(i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from
that memory using the addresses returned from those routines.

This makes things difficult, because it means that dma-capable memory allocators are often not called. (If it did, we could just hook dma_alloc() and return DMA-capable memory.)

There are some ways to solve this, but none of them are workable at modern device drivers such as NIC and NVMe.

The solutions I examined (and discarded) are

  • Using Zones to specify limited memory ranges for DMAs. This is for devices with limited address spaces, but does not work with the NVMe driver, which requires full address space.
  • bounce buffers. PCI drivers for some ARM SoCs implements this. Does not work as well, because the NVMe driver requires coherent memory and does not invoke DMA sync operation.

So, IMO, the only way to implement this PR is to make whole bootmem DMA-capable, even though this seems ugly....

tools/lkl/include/lkl.h Outdated Show resolved Hide resolved
tools/lkl/lib/vfio_pci.c Show resolved Hide resolved
tools/lkl/lib/vfio_pci.c Show resolved Hide resolved
tools/lkl/tests/vfio-pci.c Outdated Show resolved Hide resolved
tools/lkl/tests/vfio-pci.sh Outdated Show resolved Hide resolved
tools/lkl/tests/vfio-pci.c Outdated Show resolved Hide resolved
@liva liva force-pushed the x86 branch 2 times, most recently from 3c42fda to dd0f447 Compare July 30, 2020 08:45
tools/lkl/lib/vfio_pci.c Show resolved Hide resolved
tools/lkl/lib/vfio_pci.c Show resolved Hide resolved
@liva
Copy link
Author

liva commented Aug 7, 2020

I created another PR to lkl/lkl-docker, which provides QEMU and a disk image for the test of this feature.
In the commit ff9cd1b, they are generated in disk-vfio-pci.sh, which clearly slows down the overall test, so I improved this with a new docker image hosted on my account in the commit 666c357.

As soon as the PR to lkl/lkl-docker is merged, I will create a new PR (or add a commit to this PR) to replace the docker image.

@liva liva force-pushed the x86 branch 2 times, most recently from fa70c05 to 862ef0d Compare August 17, 2020 04:20
@tavip
Copy link
Member

tavip commented Aug 19, 2020

LGTM. Could you also please rebase and remove the merge commit to keep the history clean. Thanks a lot for the hard work Shinichi!

With regard to page_alloc/DMA stuff I think that it is something we can look into later. A combination of DMA zone and dma_ops should work, but I don't know how the NVMe driver operates to formulate a concrete suggestion. My rationale for avoiding bootmem as DMA capable is to avoid confusion in the host ops since in some environments / LKL application no DMA is required.

LKL is now able to manage PCI devices by itself like DPDK!
This commit implements a PCI bus driver and PCI device interface.
The commit also includes a vfio-pci backend driver as a reference
implementation, thus no extra kernel modules is needed if
the PCI device is assigned to VFIO.

I believe the vfio backend driver fulfills the need of most cases,
but users can inject wrapper functions for another userspace PCI
framework such as uio-pci-generic or a handmade kernel module.
In either case, the framework should provide physically contiguous memory
for DMA, because the kernel and some drivers (e.g. NVMe) assume its memory
as physically contiguous.

Signed-off-by: Shinichi Awamoto <shinichi.awamoto@gmail.com>
@liva
Copy link
Author

liva commented Aug 27, 2020

Thanks for all your support. Commits are rebased and merged. Also, .circleci/config.yaml refers to lkldocker/circleci-qemu-x86_64:v1.1.

I agree that we need further discussion for better memory management. Ideally, we can fix the NVMe driver to allocate its memory from DMA zone. This would enable us to seperate bootmem and DMA mem completely.

@tavip tavip merged commit 1844b34 into lkl:master Aug 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants